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Rcgularization by the sum of singular values, also referred to as the trace norm, is a 
popular technique for estimating low rank rectangular matrices. In this paper, we extend 
some of the consistency results of the Lasso to provide necessary and sufficient conditions 
for rank consistency of trace norm minimization with the square loss. We also provide 
an adaptive version that is rank consistent even when the necessary condition for the non 
adaptive version is not fulfilled. 
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1. Introduction 

In recent years, regularization by various non Euclidean norms has seen considerable inter- 
est. In particular, in the context of linear supervised learning, norms such as the £i-norm 
may induce sparse loading vectors, i.e., loading vectors with low cardinality or £o- norm - 
Such regularization schemes, also known as the Lasso (Tibshirani, 1994) for least-square 
regression, come with efficient path following algorithms (Efron et al., 2004). Moreover, 
recent work has studied conditions under which such procedures consistently estimate the 
sparsity pattern of the loading vector (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006). 

When learning on rectangular matrices, the rank is a natural extension of the cardinality, 
and the sum of singular values, also known as the trace norm or the nuclear norm, is the 
natural extension of the ^i-norm; indeed, as the £i-norm is the convex envelope of the 
^o-norm on the unit ball (i.e., the largest lower bounding convex function) (Boyd and 
Vandenberghe, 2003), the trace norm is the convex envelope of the rank over the unit ball 
of the spectral norm (Fazel et al., 2001). In practice, it leads to low rank solutions (Fazel 
et al., 2001, Srebro et al., 2005) and has seen recent increased interest in the context of 
collaborative filtering (Srebro et al., 2005), multi-task learning (Abernethy et al., 2006, 
Argyriou et al., 2007) or classification with multiple classes (Amit et al., 2007). 

In this paper, we consider the rank consistency of trace norm regularization with the 
square loss, i.e., if the data were actually generated by a low-rank matrix, will the matrix 
and its rank be consistently estimated? In Section 4, we provide necessary and sufficient 
conditions for the rank consistency that are extensions of corresponding results for the 
Lasso (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006) and the group Lasso (Bach, 
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2007). We do so under two sets of sampling assumptions detailed in Section 3.2: a full 
i.i.d assumption and a non i.i.d assumption which is natural in the context of collaborative 
filtering. 

As for the Lasso and the group Lasso, the necessary condition implies that such proce- 
dures do not always estimate the rank correctly; following the adaptive version of the Lasso 
and group Lasso (Zou, 2006), we design an adaptive version to achieve n~ ^-consistency 
and rank consistency, with no consistency conditions. Finally, in Section 6, we present a 
smoothing approach to convex optimization with the trace norm, while in Section 6.3, we 
show simulations on toy examples to illustrate the consistency results. 

2. Notations 

In this paper we consider various norms on vectors and matrices. On vectors x in M. d , 
we always consider the Euclidean norm, i.e., ||x|| = (x T x) 1 / 2 . On rectangular matrices in 
M pX9 , however, we consider several norms, based on singular values (Stewart and Sun, 1990): 
the spectral norm ||M||2 is the largest singular value (defined as ||M||2 = sup^gjj, ju| ); 
the trace norm (or nuclear norm) ||M||* is the sum of singular values, and the Frobenius 
norm \\M\\p is the ^-norm of singular values (also defined as ||M||j? = (trM T M) 1 ' 2 ). In 
Appendix A and B, we review and derive relevant tools and results regarding perturbation 
of singular values as well as the trace norm. 

Given a matrix M € W xq , vec(M) denotes the vector in M. pq obtained by stacking its 
columns into a single vector; and A <g> B denotes the Kronecker product between matrices 
A € W PlXqi and B € W 2Xq2 , defined as the matrix in W lP2Xqiq2 , defined by blocks of sizes 
P2 x 12 equal to a^B. We make constant use of the following identities: (B T &A) vec(X) = 
vec(AXB) and vec(uv T ) = v <8> u. For more details and properties, see Golub and Loan 
(1996) and Magnus and Neudecker (1998). We also use the notation Y,W for £ G W qx P q 
and W G W xq to design the matrix in W xq such that vec(SVF) = Svec(VF) (note the 
potential confusion with T,W when £ is a matrix with p columns). 

We also use the following standard asymptotic notations: a random variable Z n is said to 
be of order O p (a n ) if for any n > 0, there exists M > such that sup n P(\Z n \ > Ma n ) < rj. 
Moreover, Z n is said to be of order o p (a n ) if Z n /a n converges to zero in probability, i.e., if 
for any rj > 0, P(\Z n \ ^ rja n ) converges to zero. See Van der Vaart (1998) and Shao (2003) 
for further definitions and properties of asymptotics in probability. 

Finally, we use the following two conventions: lowercase for vectors and uppercase for 
matrices, while bold fonts are reserved for population quantities. 

3. Trace norm minimization 

We consider the problem of predicting a real random variable z as a linear function of a 
matrix M £ W xq , where p and q are two fixed strictly positive integers. Throughout this 
paper, we assume that we are given n observations (Mi, zA, i = 1, . . . , n, and we consider 
the following optimization problem with the square loss: 




(1) 
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where ||W||* denotes the trace norm of W. 

3.1 Special cases 

Regularization by the trace norm has numerous applications (see, e.g., Recht et al. (2007) 
for a review); in this paper, we are particularly interested in the following two situations: 

Lasso and group Lasso When x^ € M m , we can define Mj = Diag(£j) £ ]R mxm as the 
diagonal matrix with Xi on the diagonal. In this situation the minimization of problem 
Eq. (1) must lead to diagonal solutions (indeed the minimum trace norm matrix with fixed 
diagonal is the corresponding diagonal matrix, which is a consequence of Lemma 20 and 
Proposition 21) and for a diagonal matrix the trace norm is simply the t\ norm of the 
diagonal. Once we have derived our consistency conditions, we check in Section 4.5 that 
they actually lead to the known ones for the Lasso (Yuan and Lin, 2007, Zhao and Yu, 
2006, Zou, 2006). 

We can also see the group Lasso as a special case; indeed, if Xy G M. d J for j = 1, . . . , m, 
i = l,...,n, then we define Mj £ R^j =1 d ^ xm a s the block diagonal matrix (with non 
square blocks) with diagonal blocks Xji, j = 1, . . . , m. Similarly, the optimal W must share 
the same block-diagonal form, and its singular values are exactly the norms of each block, 
i.e., the trace norm is indeed the sum of the norms of each group. We also get back results 
from Bach (2007) in Section 4.5. 

Note that the Lasso and group Lasso can be seen as special cases where the singular 
vectors are fixed. However, the main difficulty in analyzing trace norm regularization, as 
well as the main reason for it use, is that singular vectors are not fixed and those can often 
be seen as implicit features learned by the estimation procedure (Srebro et al., 2005). In 
this paper we derive consistency results about the value and numbers of such features. 

Collaborative filtering and low-rank completion Another natural application is col- 
laborative filtering where two types of attributes x and y are observed and we consider 
bilinear forms in x and y, which can be written as a linear form in M = xy T (thus it corre- 
sponds to situations where all matrices Mj have rank one). In this setting, the matrices Mj 
are not usually i.i.d. but exhibit a statistical dependence structure outlined in Section 3.2. 
A special case here is when then no attributes are observed and we simply wish to com- 
plete a partially observed matrix (Srebro et al., 2005, Abernethy et al., 2006). The results 
presented in this paper do not immediately apply because the dimension of the estimated 
matrix may grow with the number of observed entries and this situation is out of the scope 
of this paper. 

3.2 Assumptions 

We make the following assumptions on the sampling distributions of M € M pxq for the 
problem in Eq. (1). We let denote: t mm = ± £"=i vec(Mj) vec(Mj) T € M MX w, and we 
consider the following assumptions: 

(Al) Given Mj, i = 1, . . . , n, the n values Zi are i.i.d. and there exists W 6 M. pxq such that 
for all i, E(zj|Mi, . . . , M n ) = trW T Mj and var(zj|Mi, . . . , M n ) is a strictly positive 
constant a 2 . W is not equal to zero and does not have full rank. 
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(A2) There exists an invertible matrix S mm G W" }Xpq such that E|| S mm — X mm ||^ = O(Cn) 
for a certain sequence C, n that tends to zero. 

(A3) The random variable n" 1 / 2 Y17=i £ i vec(Mj) is converging in distribution to a normal 
distribution with mean zero and covariance matrix cr 2 £ mm . 

Assumption (Al) states that given the input matrices Mi, i = 1, . . . , n we have a linear 
prediction model, where the loading matrix W is non trivial and rank-deficient, the goal 
being to estimate this rank (as well as the matrix itself). We let denote W = UDiag(s)V T 
its singular value decomposition, with U G W xr , V £ R 9Xr , and r E (0, min{p, q}) denotes 
the rank of W. We also let denote U_l G RP x (p- r ) and V ± G M^xfa-r) any or thogonal 
complements of U and V. 

We let denote e< = ^-trW T M 4 and t Mz = \ ELi z i M i ^ W><q > ^Me = ± Eti ^ = 
t,Mz - £ mm W G W xq . We may then rewrite Eq. (1) as 

min lvec(W) T t mrn vec(W)-tTW T t Mz + X n \\W\U, (2) 

or, equivalently, 

min -vec(W -W) T £ mm vec(W -W) -trW T £ M e + K\\W\L. (3) 

WeRpxi 2 

The sampling assumptions (A2) and (A3) may seem restrictive, but they are satisfied 
in the following two natural situations. The first situation corresponds to a classical full 
i.i.d problem, where the pairs (zj,Mj) are sampled i.i.d: 

Lemma 1 Assume (Al ). If the matrices Mi are sampled i.i.d., z and M have finite fourth 
order moments, and E |vec(M) vec(M) T } is invertible, then (A.2) and (A3) are satisfied 
with Cn. = n" 1 / 2 . 

Note the further refinement when for each i, Mi = XiyJ and Xi and yi are independent, 
which implies that S mm is factorized as a Kronecker product, of the form T, yy ® T, xx where 
T, xx and T, yy are the (invertible) second order moment matrices of x and y. 

The second situation corresponds to a collaborative filtering situation where two types 
of attributes are observed, e.g., x and y, and for every pair (x,y) we wish to predict z as a 
bilinear form in x and y: we first sample n x values for x, and n y values for y, and we select 
uniformly at random a subset of n ^ n x n y observations from the n x n y possible pairs. The 
following lemma, proved in Appendix C.l, shows that this set-up satisfies our assumptions: 

Lemma 2 Assume (A.1). Assume moreover that n x values xi, . . . ,x nx are sampled i.i.d 
and n y values y\,. . . ,y ny are also sampled i.i.d. from distributions with finite fourth order 
moments and invertible second order moment matrices T, xx and ^ yy - Assume also that a 
random subset of size n of pairs (ik,jk) ^ n {lj • • • > n x} x {lj • • • , n y} is sampled uniformly, 
then ifn x , n y and n tend to infinity, then (A.2 ) and (A3 ) are satisfied with S mm = T, yy <S''S xx 

j a -1/2 , -1/2 , -1/2 

and Q n = n 1 + n x + n y . 
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3.3 Optimality conditions 

From the expression of the subdifferential of the trace norm in Proposition 21 (Appendix B), 
we can identify the optimality condition for problem in Eq. (1), that we will constantly use 
in the paper: 

Proposition 3 The matrix W with singular value decomposition W = UT)i&g(s)V T (with 
strictly positive singular values s) is optimal for the problem in Eq. (1) if and only if 

t mm W - t Mz + X n UV T + N = 0, (4) 

with U T N = 0, NV = and \\N\\ 2 < A n . 

This implies notably that W and T, mm W — S^z have simultaneous singular value decom- 
positions, and the largest singular values are less than A n , and exactly equal to A n for the 
corresponding strictly positive singular values of W. Note that when all matrices are diago- 
nal (the Lasso case), we obtain the usual optimality conditions (see also Recht et al. (2007) 
for further discussions). 

4. Consistency results 

We consider two types of consistency, the regular consistency, i.e., we want the probability 
P(||Vr — W|| ^ e) to tend to zero as n tends to infinity, for all e > 0. We also consider 
the rank consistency, i.e., we want that P(rank(14 / ) ^ rank(W)) tends to zero as n tends 
to infinity. Following the similar properties for the Lasso, the consistency depends on the 
decay of the regularization parameter. Essentially, we obtain the following results: 

a) if X n does not tend to zero, then the trace norm estimate W is not consistent; 

b) if A n tends to zero faster than n -1 / 2 , then the estimate is consistent and its error is 
O p (n -1 / 2 ) while it is not rank-consistent with probability tending to one (see Sec- 
tion 4.1); 

c) if A n tends to zero exactly at rate re" 1 / 2 , then the estimator is consistent with error 
O p (re -1 / 2 ) but the probability of estimating the correct rank is converging to a limit 
in (0, 1) (see Section 4.2); 

d) if A n tends to zero more slowly than n -1 / 2 , then the estimate is consistent with error 
O p (\ n ) and its rank consistency depends on specific consistency conditions detailed 
in Section 4.3. 

The following sections will look at each of these cases, and state precise theorems. We then 
consider some special cases, i.e., factored second-order moments and implications for the 
special cases of the Lasso and group Lasso. 

The first proposition (proved in Appendix C.2) considers the case where the regulariza- 
tion parameter A n is converging to a certain limit Ao- When this limit is zero, we obtain 
regular consistency (Corollary 5 below), while if Ao > 0, then W tends in probability to a 
limit which is always different from W: 
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Proposition 4 Assume (Al), (A2) and (A3). Let W be a global minimizer of Eq. (1). If 
\ n tends to a limit Ao ^ 0, then W converges in probability to the unique global minimizer 
of 

min -vecCW- W) T £ mm vec(W - W) + A ||W|L. 

Corollary 5 Assume (A.1), (A2) and (A3). Let W be a global minimizer of Eq. (1). If 
\ n tends to zero, then W converges in probability to W. 

We now consider finer results when X n tends to zero at certain rates, slower or faster than 
re -1 / 2 , or exactly at rate ra -1 / 2 . 



4.1 Fast decay of regularization parameter 

The following proposition — which is a consequence of standard results in M-estimation (Shao, 
2003, Van der Vaart, 1998) — considers the case where ra 1 ' 2 A n is tending to zero, where we 
obtain that W is asymptotically normal with mean W and covariance matrix n" 1 ^" 2 ^"^, 
i.e., for fast decays, the first order expansion is the same as the one with no regularization 
parameter: 

Proposition 6 Assume (Al), (A2) and (A3). Let W be a global minimizer of Eq. (1). If 
n l l 2 \ n tends to zero, n l l 2 (W — W) is asymptotically normal with mean W and covariance 
matrix cr 2 S~^. 

We now consider the corresponding rank consistency results, when X n goes to zero 
faster than n~ 1 / 2 . The following proposition (proved in Appendix C.3) states that for such 
regularization parameter, the solution has rank strictly greater than r with probability 
tending to one and can thus not be rank consistent: 

Proposition 7 Assume (Al), (A2) and (A3). Ifn x l 2 \ n tends to zero, then P(rank(V[^) > 
rank(W)) tends to one. 

4.2 n _1 ' 2 -decay of the regularization parameter 

We first consider regular consistency through the following proposition (proved in Ap- 
pendix C.4), then rank consistency (proposition proved in Appendix C.5): 

Proposition 8 Assume (Al), (A2) and (A3). Let W be a global minimizer of Eq. (1). If 
n l / 2 \ n tends to a limit Ao > 0, then n l l 2 (W — W) converges in distribution to the unique 
global minimizer of 



min - vec(A) T S mm vec(A) — trA T ^4 + Ao 

AeR pX9 2 



trU T AV+ IIUTAVj 



where vec(^4) £ W q is normally distributed with mean zero and covariance matrix a Yl mm . 

Proposition 9 Assume (Al), (A2) and (A3). If n l / 2 \ n tends to a limit Ao > 0, then 
the probability that the rank of W is different from the rank of W is converging to P(||A — 
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Ao 1 ©^ < 1) € (0,1) where A G R(p-r)x( g -r) ^ ^e/med in Eg. f6j (Section 4.3) and 
£ M(p _r ) x (9 _r ) /jas a normal distribution with mean zero and covariance matrix 

a 2 f(V ± ®U ± ) T S^„(V ± ®Uj ^ 



The previous proposition ensures that the estimate cannot be rank consistent with this 
decay of the regularization parameter. Note that when we take Ao small (i.e., we get closer 
to fast decays), the probability IP(||A — Aq 1 1 1 2 ^ 1) tends to zero, while when we take 
Ao large (i.e., we get closer to slow decays), the same probability tends to zero or one 
depending on the sign of ||A||2 — 1. This heuristic argument is made more precise in the 
following section. 

4.3 Slow decay of regularization parameter 

When A n tends to zero more slowly than n -1 / 2 , the first order expansion is deterministic, 
as the following proposition shows (proof in Appendix C.6): 

Proposition 10 Assume (A.1), (A2) and (A3). Let W be a global minimizer of Eq. (1). 
Ifn l / 2 \ n tends to +00 and X n tends to zero, then \~ l (W — W) converges in probability to 
the unique global minimizer A of 

min ivec(A) T S mm vec(A) + trU T AV+ ||UlAVj.||*. (5) 

AeJRp^ 2 

Moreover, we have W = W + A n A + O p {\ n + ( n + \~ l n~ 1 / 2 ). 

The last proposition gives a first order expansion of W around W. From Proposition 18 
(Appendix B), we obtain immediately that if UjAVj_ is different from zero, then the rank 
of W is ultimately strictly larger than r. The condition UjAVj_ = is thus necessary for 
rank consistency when A n n 1//2 tends to infinity while A n tends to zero. The next lemma 
(proved in Appendix 11), gives a necessary and sufficient condition for UjAVj_ = 0. 

Lemma 11 A ssume Yj mm is invertible, and — UDiag(s)V~' _ is the singular value de- 
composition of W. Then the unique global minimizer of 

vec(A) T S mm vec(A) +trU T AV+ ||UlAV ± ||* 

satisfies UTAVj_ = if and only if 



Vl ®U ± ) T E- 1 m (V 1 8Ui)) 1 ((Vl ®U ± ) T S- 1 m (V®U)vec(I) 



< 1. 

2 



This leads to consider the matrix A £ r ) x W r ) defined as 

vec(A) = ((Vl <8> U ± ) T £"J n (V ± ® U x )) ((V ± ® U ± ) T S~ 1 m (V ® U) vec(I)) , (6) 
and the two weak and strict consistency conditions: 

b < 1, (7) 
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I|A|| 2 < 1. 



(8) 



Note that if 

S ra m is proportional to identity, they are always satisfied because then A — 0. 
We can now prove that the condition in Eq. (8) is sufficient for rank consistency when n 1 / 2 A„, 
tends to infinity, while the condition Eq. (7) is necessary for the existence of a sequence A^ 
such that the estimate is both consistent and rank consistent (which is a stronger result 
than restricting A n to be tending to zero slower than n~ 1//2 ). The following two theorems 
are proved in Appendix C.8 and C.9: 

Theorem 12 Assume (Al), (A2), (A3). Let W be a global minimizer of Eq. (1). If the 
condition in Eq. (8) is satisfied, and if n 1/,2 A n tends to +oo and A n tends to zero, then the 
estimate W is consistent and rank- consistent. 

Theorem 13 Assume (Al), (A2) and (A3). Let W be a global minimizer of Eq. (1). If 
the estimate W is consistent and rank- consistent, then the condition in Eq. (7) is satisfied. 

As opposed to the Lasso, where Eq. (7) is a necessary and sufficient condition for rank 
consistency (Yuan and Lin, 2007), this is not even true in general for the group Lasso (Bach, 
2007). Looking at the limiting case ||A||2 = 1 would similarly lead to additional but more 
complex sufficient and necessary conditions, and is left out for future research. 

Moreover, it may seem surprising that even when the sufficient condition Eq. (8) is 
fulfilled, that the first order expansion of W, i.e., W = W + A n A + o p (\ n ) is such that 
UjAV^ = 0, but nothing is said about UjAV and U T AVj_, which are not equal to zero 
in general. This is due to the fact that the first r singular vectors U and V of W + A n A 
are not fixed; indeed, the r first singular vectors (i.e., the implicit features) do rotate but 
with no contribution on Uj_Vj. This is to be contrasted with the adaptive version where 
asymptotically the first order expansion has constant singular vectors (see Section 5). 

Finally, in this paper, we have only proved whether the probability of correct rank 
selection tends to zero or one. Proposition 9 suggests that when \ n n 1 / 2 tends to infinity 
slowly, then this probability is close to P(||A — A" 1 ^ 1 / 2 ©!^ ^ 1), where has a normal 
distribution with known covariance matrix, which converges to one exponentially fast when 
||A||2 < 1. We are currently investigating additional assumptions under which such results 
are true and thus estimate the convergence rates of the probability of good rank selection 
as done by Zhao and Yu (2006) for the Lasso. 

4.4 Factored second order moment 

Note that in the situation where n x points in W and n y points in M q are sampled i.i.d and a 
random subset of n points in selected, then, we can refine the condition as follows (because 

^mm — ^yy ® ^xx)- 

a = (uIs- 1 u ± )- 1 uIs^ 1 uv T s ro 1 v ± (vIs- y 1 v ± )- 1 , 

which is equal to (by the expression of inverses of partitioned matrices): 
A = (UlS xx U)(U T S xx U)- 1 (V T S TO V)" 1 (V T S ra V ± ). 
This also happens when Mj = XiyJ and Xi and yi independent for all i. 
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4.5 Corollaries for the Lasso and group Lasso 

For the Lasso or the group Lasso, all proposed results in Section 4.3 should hold with the 
additional conditions that W and A are diagonal (block-diagonal for the group Lasso). In 
this situation, the singular values of the diagonal matrix W = Diag(w) are the norms of 
the diagonal blocks, while the left singular vectors are equal to the normalized versions 
of the block (the signs for the Lasso). However, the results developed in Section 4.3 do 
not immediately apply since the assumptions regarding the invertibility of the second order 
moment matrix is not satisfied. For those problems, all matrices M that are ever considered 
belong to a strict subspace of W pxq and we need to satisfy invertibility on that subspace. 

More precisely, we assume that all matrices M are such that vec(M) = Hx where H is a 
given design matrix in ]RP'J XS where s is the number of implicit parameter and i£l s , If we 
replace the invertibility of S mm by the invertibility of H T T, mm H, then all results presented 
in Section 4.3 are valid, in particular, the matrix A may be written as 



vec(A) = ((V ± ® U ± ) T H(H T Z mm H)- 1 H T (V ± <g> Uj 

x ((Vi <8> U ± ) T F(F T S mmJ ff)^ 1 F T (V ® U) vec(I)) , (9) 

where denotes the pseudo-inverse of A (Golub and Loan, 1996). 

We now apply Eq. (9) to the case of the group Lasso (which includes the Lasso as 

a special case). In this situation, we have M = Diag(xi, . . . , x m ) and each Xj S , 

j = 1, . . . , x m ; we consider it; as being defined by blocks w\, . . . , w m , where each Wj € M. d i . 

The design matrix H is such that Hw = vec(Diag(w)) and the matrix H 1 Yi mm H is exactly 

equal to the joint covariance matrix T, xx of x = (xi, . . . , x m ). Without loss of generality, we 

assume that the generating sparsity patttern corresponds to the first r blocks. We can then 

compute the singular value decomposition in closed form as U = (( Diag ( Wi /ll Wi ll)^ r ) ; v = (J) 

and s = (||wj||)j^ r . If we let denote, for each j, Oj a basis of the subspace orthogonal to 

i tt ( Dmg(OA< r \ , /Q\ t t t ,i -i 

Wj, we have: Uj_ =1 I / & = 6 Can ^ singular vectors 

into Eq. (9) and get (H T T, mm H)- 1 H T [v <g> U) vec(I) = (£~i)j,jc77j, where J = {1, . . . ,r} 
and r]j is the vector of normalised Wj, j £ J. Thus, for the group Lasso, we finally obtain: 

= ||Diag [((S-i) J c JC )- 1 (S-i)j, Jc?? j] || 2 



Diag 



(S xx )j=j(S xa; ) J 



by the partitioned matrices inversion lemma, 

2 



max 1 1 T, XiXj ^ XJXJ rjj \ 



The condition on the invertibility of H T Y< mm H is exactly the invertibility of the full joint 
covariance matrix of x = (x±, . . . ,x m ) and is a standard assumption for the Lasso or the 
group Lasso (Yuan and Lin, 2007, Zhao and Yu, 2006, Zou, 2006, Bach, 2007). Moreover 
the condition ||A||2 ^ 1 is exactly the one for the group Lasso (Bach, 2007), where the 
pattern consistency is replaced by the consistency for the number of non zero groups. 

Note that we only obtain a result in terms of numbers of selected groups of variables 
and not in terms of the identities of the groups themselves. However, because of regular 
consistency, we know that at least the r true groups will be selected, and then correct model 
size is asymptotically equivalent to the correct groups being selected. 
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5. Adaptive version 

We can follow the adaptive version of the Lasso to provide a consistent algorithm with no 
consistency conditions such as Eq. (7) or Eq. (8). More precisely, we consider the least- 
square estimate \qc{Wls) = ^ mm vec (^Mz)- We have the following well known result for 
least-square regression: 

Lemma 14 Assume (Al), (A2) and (A3). Then n 1 / 2 (S~J n vec(SMz) — vec(W)) is con- 
verging in distribution to a normal distribution with zero mean and covariance matrix 

u mm' 

We consider the singular value decomposition of Wls = Uls Diag(si5)V^ 5 , where sls ^ 
0. With probability tending to one, min{p, q} singular values are strictly positive (i.e. the 
rank of Wls is full). We consider the full decomposition where Uls an d Vls are orthogonal 
square matrices and the matrix Diag(s£s) is rectangular. We complete the singular values 
sls € M mm { p ' IJ } by n -1 / 2 to reach dimensions p and q (we keep the same notation for both 
dimensions for simplicity). 

For 7 G (0, 1], we let denote 

A = ULsDmgisLsTm^s G R pxp and B = V LS Diag( SL5 )^ 7 F L T 5 G ^ qXq , 

two positive definite symmetric matrices, and, following the adaptive Lasso of Zou (2006), 
we consider replacing ||W||* by ||AWB||* — note that in the Lasso special case, this exactly 
corresponds to the adaptive Lasso of Zou (2006). We obtain the following consistency 
theorem (proved in Appendix C.10): 

Theorem 15 Assume (Al), (A2) and (A3). If-/ G (0, 1], ?i 1/2 \ n tends to and\ n n x l 2 ^l 2 
tends to infinity, then any global minimizer Wa of 



1 - 

— J2(zi - tiW T Mi) 2 + X n \\AWB\\, 



2n 

i=l 



is consistent and rank consistent. Moreover, n 1//2 vec(WA — W) is converging in distribution 
to a normal distribution with mean zero and covariance matrix 



<7 2 (V®U) 



;v®u) T s mm (v®u) 



1 (V®XJ) T . 



Note the restriction 7^1 which is due to the fact that the least-square estimate Wls 
only estimates the singular subspaces at rate O p (n~ 1 ^ 2 ). In Section 6.3, we illustrate the 
previous theorem on synthetic examples. In particular, we exhibit some singular behavior 
for the limiting case 7 = 1. 

6. Algorithms and simulations 

In this section we provide a simple algorithm to solve problems of the form 

min lvec(W) T J:vec(W)-trW T Q + X\\W\L, (10) 
VKeRpx? 2 
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where £ G W qxpq is a positive definite matrix (note that we do not restrict £ to be of the 
form £ = A <g> B where A and -B are positive semidefinite matrices of size p x p and gxg). 
We assume that vec(Q) is in the column space of £, so that the optimization problem is 
bounded from below (and thus the dual is feasible). In our setting, we have E — E mm and 

Q = £mz- 

We focus on problems where p and q are not too large so that we can apply Newton's 
method to obtain convergence up to machine precision, which is required for the fine analysis 
of rank consistency in Section 6.3. For more efficient algorithms with larger p and q, 
see Srebro et al. (2005), Rennie and Srebro (2005) and Abernethy et al. (2006). 

Because the dual norm of the trace norm is the spectral norm (see Appendix B), the 
dual is easily obtained as 

max -^vec(Q-Ay) T S- 1 vec(Q- XV). (11) 



veRP x i,\\v\\ 2 ^i 2 



Indeed, we have: 



min -vec(W) T £vec(W0 -tvW T Q + X\\W\L 

W&Rp x i 2 

= min max - vec(VT) T £ vec(W) - tiW T Q + XtrV T W 

wmpxi veRP x «,||V|| 2 <i 2 

= max min - vec(VK) T E vec(W) - trW T Q + XtrV T W 

yeRP x 9,||y|| 2 ^i weRp^i 2 

max — vec(Q- Ay) T E~ 1 vec(Q - XV), 
veRp x i,\\v\\ 2 ^i 2 

where strong duality holds because both the primal and dual problems are convex and 
strictly feasible (Boyd and Vandenberghe, 2003). 

6.1 Smoothing 

The problem in Eq. (10) is convex but non differentiable; in this paper we consider adding 
a strictly convex function to its dual in Eq. (11) in order to make it differentiable, while 
controlling the increase of duality gap yielded by the added function (Bonnans et al., 2003). 
We thus consider the following smoothing of the trace norm, namely we define 

FJW)= max tvV T W - eB(V), 

-~ >x«,||y|i 2 <i 



where B(V) is a spectral function (i.e., that depends only on singular values of V, equal to 
B{V) = YJt=i P,q} b{si{V)) where b(s) = (1 + s) log(l + s) + (1 - s) log(l - s) if \s\ < 1 and 
+oo otherwise (si(V) denotes the i-th largest singular values of V). This function F £ may 
be computed in closed form as: 

min{p,g} 

F £ (W)= b *(^(W)), 

where b*(s) = elog(l + e v ' E ) + elog(l + e~ v / £ ) — 2elog2. These functions are plotted in 
Figure 1; note that \b*(s) — \s\\ is uniformly bounded by 2 log 2. 
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Figure 1: Spectral barrier functions: (left) primal function b(s) and (right) dual functions 
b*(s). 



We finally get the following pairs of primal/dual optimization problems: 
min - vec(W0 T £ vec(W) - trW 1 Q + XF e/x (W), 

max -- vec(Q - Al/)^- 1 vec(Q - XV) - eB(V). 

We can now optimize directly in the primal formulation which is infinitely differentiable, 
using Newton's method. Note that the stopping criterion should be an e x min{p, q} duality 
gap, as the controlled smoothing also leads to a small additional gap on the solution of the 
original non smoothed problem. More precisely, a duality gap of e x minjp, q} on the 
smoothed problem, leads to a gap of at most (1 + 21og2)e x min{p, q} for the original 
problem. 

6.2 implementation details 

Derivatives of spectral functions Note that derivatives of spectral functions of the 
form B{W) = Y^ {m} K*i(W)), where b is an even twice differentiable function such that 
6(0) = b'(0) = 0, are easily calculated as follows; Let UDiag(s)V T be the singular value 
decomposition of W . We then have the following Taylor expansion (Lewis and Sendov, 
2002): 

B(W + A) = B(W) + tvA T UBi ag (b'( Si ))V T + ^ E ^^^(uJAv,) 2 , 

i=i j=i J 

where the vector of singular values is completed by zeros, and b — b is defined as b"(si) 

S{ Sj 

when Si = Sj. 

Choice of g and computational complexity Following the common practice in bar- 
rier methods we decrease the parameter geometrically after each iteration of Newton's 
method (Boyd and Vandenberghe, 2003). Each of these Newton iterations has complexity 



12 



0(p s q 3 ). Empirically, the number of iterations does not exceed a few hundreds for solving 
one problem up to machine precision 1 . We are currently investigating theoretical bounds on 
the number of iterations through self concordance theory (Boyd and Vandenberghe, 2003). 

Start and end of the path In order to avoid to consider useless values of the regular- 
ization parameter and thus use a well adapted grid for trying several A's, we can consider a 
specific interval for A. When A is large, the solution is exactly zero, while when A is small, 
the solution tends to vec(W) = S _1 vec(Q). 

More precisely, if A is larger than ||Q||2j then the solution is exactly zero (because 
in this situation is in the subdifferential) . On the other side, we consider for which 
A, S _1 vec((5) leads to a duality gap which is less than e vec(Q) T S _1 vec(Q), where e is 
small. A looser condition is to take V = 0, and the condition becomes A||S _1 vec(Q)||* ^ 
e vcc (Q) T S _1 vec(Q). Note that this is in the correct order (i.e. lower bound smaller than 
upper bound ), because 

vec(Q) T S- 1 vec(Q) = (vec(Q), S" 1 vec(Q)} ||S _1 vec(Q)||*|| vec(Q)|| 2 . 

This allows to design a good interval for searching for a good value of A or for computing 
the regularization path by uniform grid sampling (in log scale) , or numerical path following 
with predictor-corrector methods such as used by Bach et al. (2004). 

6.3 Simulations 

In this section, we perform simulations on toy examples to illustrate our consistency results. 
We generate random i.i.d. data X and Y with Gaussian distributions and we select a low 
rank matrix W at random and generate Z = diag(X T ~WY)+e where e has i.i.d components 
with normal distributions with zero mean and known variance. In this section, we always use 
r = 2, p = q = 4, while we consider several numbers of samples n, and several distributions 
for which the consistency conditions Eq. (7) and Eq. (8) may or may not be satisfied 2 . 

In Figure 2, we plot regularization paths for n = 10 3 , by showing the singular values of 
W compared to the singular values of W, in two particular situations (Eq. (7) and Eq. (8) 
satisfied and not satisfied), for the regular trace norm regularization and the adaptive 
versions, with 7 = 1/2 and 7 = 1. Note that in the consistent case (top), the singular 
values and their cardinalities are well jointly estimated, both for the non adaptive version (as 
predicted by Theorem 12) and the adaptive versions (Theorem 15), while the range of correct 
rank selection increases compared to the adaptive versions. However in the inconsistent case, 
the non adaptive regularizations scheme (bottom left) cannot achieve regular consistency 
together with rank consistency (Theorem 13), while the adaptive schemes can. Note the 
particular behavior of the limiting case 7 = 1, which still achieves both consistencies but 
with a singular behavior for large A. 

In Figure 3, we select the distribution used for the rank-consistent case of Figure 2, and 
compute the paths from 200 replications for n = 10 2 , 10 3 , 10 3 and 10 5 . For each A, we plot 
the proportion of estimates with correct rank on the left plots (i.e., we get an estimation 
of P(rank(FF) = rank(W)), while we plot the logarithm of the average root mean squared 

1. MATLAB code can be downloaded from http://www.di.ens.fr/~fbach/tracenorm/ 

2. Simulations may be reproduced with MATLAB code available from http://www.di.ens.fr/~fbach/ 
tracenorm/ 
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Figure 2: Examples of paths of singular values for ||A||2 = 0.49 < 1 (consistent, top) and 
||A||2 = 4.78 > 1 (inconsistent, bottom) rank selection: regular trace norm penal- 
ization (left) and adaptive penalization with 7 = 1/2 (center) and 7 = 1 (right). 
Estimated singular values are plotted in plain, while population singular values 
are dotted. 
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estimation error \\W — W|| on the right plot. For the three regularization schemes, the 
range of values with high probability of correct rank selection increases as n increases, and, 
most importantly achieves good mean squared error (right plot); in particular, for the non 
adaptive schemes (top plots), this corroborates the results from Proposition 9, which states 
that for A n = Xqu^ 1 ^ the probability tends to a limit in (0, 1): indeed, when n increases, 
the value A n which achieves a particular limit grows as n _1//2 , and considering the log-scale 
for A n in Figure 3 and the uniform sampling for n in log-scale as well, the regular spacing 
between the decaying parts observed in Figure 3 is coherent with our results. 

In Figure 4, we perform the same operations but with the inconsistent case of Figure 2. 
For the non adaptive case (top plot), the range of values of A that achieve high probability 
of correct rank selection does not increase when n increases and stays bounded, in places 
where the estimation error is not tending to zero: in the inconsistent case, the trace norm 
regularization does not manage to solve the trade-off between rank consistency and regular 
consistency. However, for the adaptive versions, it does, still with a somewhat singular 
behavior of the limiting case 7 = 1. 

Finally, in Figure 5, we consider 400 different distributions with various values of ||A|| 
smaller or greater than one, and computed the regularization paths with n = 10 3 samples. 
From the paths, we consider the estimate W with correct rank and best distance to W and 
plot the best error versus log 10 ( 1 1 A 1 1 2 ) - For positive values of log 10 ( 1 1 A || 2 ) , the best error is 
far from zero, and the error grows with the distance to zero; while for negative values, we 
get low errors with lower errors for small log 10 ( 1 1 A 1 1 2 ) , corroborating the influence of ||A||2 
described in Proposition 9. 

7. Conclusion 

We have presented an analysis of the rank consistency for the penalization by the trace norm, 
and derived general necessary and sufficient conditions. This work can be extended in several 
interesting ways: first, by going from the square loss to more general losses, in particular 
for other types of supervised learning problems such as classification; or by looking at the 
collaborative filtering setting where only some of the attributes are observed (Abernethy 
et al., 2006) and dimensions p and q are allowed to grow. Moreover, we are currently 
pursuing non asymptotic extensions of the current work, making links with the recent work 
of Recht et al. (2007) and of Meinshausen and Yu (2006). 



Appendix A. Tools for analysis of singular value decomposition 

In this appendix, we review and derive precise results regarding singular value decompo- 
sitions. We consider W G W xq and we let denote W = UT)i&g(s)V T its singular value 
decomposition with U £ R pxr , V € M <?xr with orthonormal columns, and s 6 W with 
strictly positive values (r is the rank of W). Note that when a singular value Si is simple, 
i.e., does not coalesce with any other singular values, then the vectors itj and V{ are uniquely 
defined up to simultaneous sign flips, i.e., only the matrix UivJ is unique. However, when 
some singular values coalesce, then the corresponding singular vectors are defined up to a 
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Figure 3: Synthetic example where consistency condition in Eq. (8) is satisfied: probabil- 
ity of correct rank selection (left) and logarithm of the expected mean squared 
estimation error (right), for several number of samples as a function of the reg- 
ularization parameter, for regular regularization (top), adaptive regularization 
with 7 = 1/2 (center) and 7 = 1 (bottom). 
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Figure 4: Synthetic example where consistency condition in Eq. (7) is not satisfied: proba- 
bility of correct rank selection (left) and logarithm of the expected mean squared 
estimation error (right), for several number of samples as a function of the reg- 
ularization parameter, for regular regularization (top), adaptive regularization 
with 7 = 1/2 (center) and 7 = 1 (bottom). 
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Figure 5: Scatter plots of log 10 ( 1 1 A 1 1 2 ) versus the squared error of the best estimate with 
correct rank (i.e., such that rank(W) = r and \\W — W|| as small as possible). 
See text for details. 



rotation, and thus in general care must be taken and considering isolated singular vectors 
should be avoided (Stewart and Sun, 1990). All tools presented in this appendix are robust 
to the particular choice of the singular vectors. 

A.l Jordan- Wielandt matrix 

We use the fact that singular values of W can be obtained from the eigenvalues of the Jordan- 

Wielandt matrix W = ( ^ T ^ J G ]R(p+9) x (p+9) (Stewart and Sun, 1990). Indeed this 

matrix has eigenvalues Si and — Sj, i = 1, . . . , r, where Sj are the (strictly positive) singular 

values of W, with eigenvectors -4= ( U% ) and -4= ( U% ) where Ui,Vi are the left and 

right associated singular vectors. Also, the remaining eigenvalues are all equal to zero, with 

eigensubspace (of dimension p+q—2r) composed of all ( ^ ^ such that for alii 6 {1, . . . , r}, 

u T Ui = v T Vi = 0. We let denote U the eigenvectors of W corresponding to non zero 

eigenvalues in S. We have U = 4= ( ^ ] and S = 4= ( ^ a ^ s ^ ® ) anc l 

V2 y V -V J V2 y -Diag(s) 

W = USU 1 , UU 1 = ( UU J y yT ) , and [7sign(5)L7 T = ( ^ UV J 
A. 2 Cauchy residue formula and eigenvalues 

Given the matrix W, and a simple closed curve C in the complex plane that does not go 
through any of the eigenvalues of W, then 

1 f d\ 



2iit J c AI - W 
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is equal to the orthogonal projection onto the orthogonal sum of all eigensubspaces of W 
associated with eigenvalues in the interior of C (Kato, 1966). This is easily seen by writing 
down the eigenvalue decomposition and the Cauchy residue formula (^i- § c = 1 if A, 
is in the interior int(C) of C and otherwise), and: 

If dX T 1 / dX v T 

2mJ c Xl-W 1 2m J c \-Si . 1 

See Rudin (1987) for an introduction to complex analysis and Cauchy residue formula. 
Moreover, we can obtain the restriction of W onto a specific eigensubspace as: 

i r wdx i r Xdx 



y ' 2mJ c Xl-W 2m J c 



[ c XI-W 

We let denote s\ and s r the largest and smallest strictly positive singular values of W; if 
II A || 2 < s r /2, then W + A has r singular values strictly greater than s r /2 and the remaining 
ones are strictly less than s r /2 (Stewart and Sun, 1990). Thus, if we denote C the oriented 
circle of radius s r /2, ILc(W) is the projector on the p + q — 2r-dimensional null space of W, 
and for any A such that || A||2 < s r /2, Hc(W + A) is also the projector on the p + q — 2r- 
dimensional invariant subspace of W + A, which corresponds to the smallest eigenvalues. 
We let denote n o (W + A) that projector and Il r (W + A) = I - U (W + A) the orthogonal 
projector (which is the projection onto the 2r-th principal subspace). 
We can now find expansions around A = as follows: 

U Q (W + A) - U (W) = — I (XI- W)- l A(Xl-W- A^dX 

2m J c 

j{Xl - ^) _1 A(AI - W)~ 1 dX 



1 

2m 



+— i(Xl - W) _1 A(AI - wy l A(Xl - W - AV l dX, 
2m J c 



and 



(W + A)U (W + A)-WU (W) = — — / A(AI - W)- l A(XI - W - A)^dX 

2m J c 

j> A(AI - ^) _1 A(AI - W)- l dX 



1 

2ivr 



-— / 

2m J c 



A(AI - W)~ l A{Xl - W)~ l A{Xl - W - A^dX 



which lead to the following two propositions: 

Proposition 16 Assume W has rank r and \\A\\2 < s r /4 where s r is the smallest positive 
singular value of W . Then the projection H r (W) on the first r eigenvectors of W is such 
that 

\\U (W + A)-U (W)\\ 2 ^ -||A|| 2 



and 



\U (W + A) - U (W) - (I - [7i7 T )Ai75" 1 [7 T - US^U T A{I - UU T )\\ 2 ^ —\\A\\l 



•S7 
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Proof For A G C we have: ||(AI - W) _1 ||2 ^ 2/s r and ||(AI - W - A)^ 1 ^ ^ 4/s r , which 
implies 

||n r (W + A)-n P (W0|| 2 *S || (AI - v^)- 1 Ibll A|| 2 |l (AI - w - A)- 1 ^ 

In order to prove the other result, we simply need to compute: 

— /(AI - W)- l A(\I - WY l dX = S^UiujAujuJ— I \- -dX 

2i<JrJc ^ J 3 2iirJ c (X-s l )(X-s j ) 

= Y, UiujAuju] ( 1 ^ int(C) _ ljeint(C) + httoWlietojcA 
= {I-UU T )AUS' 1 U T + US- 1 U T A(I-UU T ). 



Proposition 17 Assume W has rank r and \\A\\2 < s r /4 where s r is the smallest positive 
singular value of W . Then the projection Tl r (W) on the first r eigenvectors of W is such 
that 

\\IL (W + A)(W + A) - U {W)W\\ 2 < 2||A|| 2 



and 



\Uo(W + A)(W + A) - U (W)W + (I - m? T )A(I - l7l7 t )|| 2 < ^||A|||. 



Proof For A € C we have: ||(AI - W)" 1 ^ > 2/s r and ||(AI - W - A) _1 || 2 > 4/s r , which 
implies 

\\Il r (W + A) -U r (W)\\ 2 < jf |A|||(AI - T¥) _1 || 2 ||A|| 2 ||(AI - W - A)" 1 ^ 

< f 1 2 Sr ^) Sr ||A|| 2 4 
V 27r 2)2 s r s r 



In order to prove the other result, we simply need to compute: 

I A(AI- WY X A(\\-Wy x d\ = - V^u/A^J— / 



2m J c *rf % J J 2i7r Jc (X - Si)(X - Sj) 

= - ^UiujAuju] (lieint(C)lieint(C)) 

= —(I — UU T )A(I — UU T ). 



dX 
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The variations of IT(W) translates immediately into variations of the singular pro- 
jections UU T and VV T . Indeed we get that the first order variation of UU T is — (I — 
UU T )AVS~ 1 U T and the variation of V is equal to -(I - VV T ) A T U S~ 1 V T , with errors 
bounded in spectral norm by -%||A||2- Similarly, when restricted to the small singular val- 

ues, the first order expansion is (I-UU T )A(I-VV T ), with error term bounded in spectral 
norm by — ||A||2. Those results lead to the following proposition that gives a local sufficient 
condition for rank(VF + A) > rank(VF): 

Proposition 18 Assume W has rank r < min{p, q} with ordered singular value decompo- 
sitionW = f/Diag(s)T/ T . If ±\\A\\l < ||(I - UU T ) A(I - VV T )\\ 2 , then rank(VK + A) > r. 

Appendix B. Some facts about the trace norm 

In this appendix, we review known properties of the trace norm that we use in this paper. 
Most of the results are extensions of similar results for the i^-norm on vectors. First, we 
have the following result: 

Lemma 19 (Dual norm, Fazel et al., 2001) The trace norm \\ ■ H* is a norm and its 
dual norm is the operator norm \\ ■ ||. 

Note that the dual norm N(W) is defined as (Boyd and Vandenberghe, 2003): 

N(W) = sup tiW T V. 

This immediately implies the following result: 

Lemma 20 (Fenchel conjugate) We have: max trW T F — \\W\\* = if \\V\\ ^ 1 and 
+oo otherwise. 

In this paper, we need to compute the subdifferential and directional derivatives of the 
trace norm. We have from Recht et al. (2007) or Borwein and Lewis (2000): 

Proposition 21 (Subdifferential) IfW = UDi&g(s)V T with U G W xm and V G M 9Xm 
having orthonormal columns, and s G W n is strictly positive, is the singular value decom- 
position ofW, then \\W\\* = YaLi s « an< ^ the subdifferential of \\ • ||* is equal to 

d\\ ■ \\*(W) = {UV T + M, such that \\M\\ 2 < 1, U T M = and MV = o} . 

This result can be extended to compute directional derivatives: 

Proposition 22 (Directional derivative) The directional derivative at W = USV T is 
equal to: 

Inn ll ^ + £A|1 --"^ = trU^AV + \\UjAV ± \U, 

e^0+ e 

where U ± G W x ^~ m ^ and V± G M^xfe-m) 

are any orthonormal complements of U and V . 
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Proof From the subdifferential, we get the directional derivative (Borwein and Lewis, 
2000) as 

\\yy + eAIL - II WIL 
lim = max trA V 

e->o+ e vea||-j|*(W) 

which exactly leads to the desired result. ■ 

The final result that we use is a bit finer as it gives an upper bound on the error in the 
previous limit: 

Proposition 23 Let W = UDidtg(s)V T the ordered singular value decomposition, where 
rank(W) = r, s > and U± and V± be orthogonal complement of U and V ; then, if 
||A|| 2 < s r /A: 



\W + AIL - IIWIL - tvU T AV - \\UjAV± 



^ 16min{p, g}-i||A||| 



Proof The trace norm of ||W + A||* may be divided into the sum of the r largest and the 
sum of the remaining singular values. The sums of the remaining ones are given through 
Proposition 17 by ||f/jAVj_||* with an error bounded by min{p, q}— ||A||2. For the first r 
singular values, we need to upperbound the second derivative of the sum of the r largest 
eigenvalues of W + A with strictly positive eigengap, which leads to the given bound by 
using the same Cauchy residue technique described in Appendix A. ■ 



Appendix C. Proofs 
C.l Proof of Lemma 2 

We let denote S € {0, i} n ^ xn » the sampling matrix; i.e., Sij = 1 if the pair (i,j) is observed 
and zero otherwise. We let denote X and Y the data matrices. We can write = 
X T 5 ik 5]Y and: 

- n 1 n 

-^vec(M fc )vec(M fc ) T = - ^(y ® X) T vec(8 ik 6] k ) vec(5 ik 5] k ) T (Y ® X) 

k=l k=l 

= -(Y ® A > ) T Diag(vec(5))(y ® X), 
n 

which leads to (denoting T, xx = n~ 1 X T X and Ti yy = n~ 1 Y T Y): 
1 n \ 1 

-y^vec(M fc )vec(M fc ) T -t vv ®t xx = —(Y® X) T DiagfvecfS" — n/n x n v ))(Y ® X). 

n. — ' / n 
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We can thus compute the squared Frobenius norm: 
1 n 

- ^ vec(M fc ) vec(M fc ) T - t yy ® t 2 



k=l 



n- 



r tr Diag(vec(S - n/n x n y ))(YY T ® II T ) Diag(vec(S - n/n x n y ))(YY T <g> JOT T ) 



1 



= ~2 E ~ n/n x riy)(YY r ® XX r ) ijti/j ,{Si?j' - n/n x n y ){YY T ® II T ) iiiiy . 

We have, by properties of sampling without replacement (Hoeffding, 1963): 

E(S„ - n/n x n y )(Si>j> - n/n x n y ) = n/n x n y (l - n/n x n y ) if = (i',f), 

E(Sij - n/n x n y )(Si>j> - n/n x n y ) = -n/n x n y (l - n/n x n y ) 

This implies 
1 " 

E(||- vec(M fe ) vec(M fc ) T - £ ra g> E^HHX, F) 
=i 



if 



k=l 



i.J 



{n x n y - l)n x n y n 



4 II ~ i|4 



This finally implies that 



E 



1 ™ 

- Y v ec(M fc ) vec(M fc ) T - E yy ® 



fc=i 



sS — E||x|| 4 E||y|| 4 + 2E||£, r , r — S- ra .||^E||S ra ||p' + 2E||Sj /3/ — Ej /J ,|||'||E :ca .||f' 



'■j 



1 1 1 



CE||x|| 4 E||y|| 4 x (- + — + — ), 

for some constant C > 0. This implies (A2). To prove the asymptotic normality in 
(A3), we use the martingale central limit theorem (Hall and Heyde, 1980) with sequence 
of cr-fields T n ^ k = a(X,Y,ei, . . . ,e k ,(h,ji), ■ ■ ■ ,(ik,jk)) for k ^ n. We consider A„ ifc = 
n ~ 1 ^ 2 £i k j k Uj k ® %i k £ as the martingale difference. We have ~E(A n k\J- n k_i) = and 

E(A nifc A^ fc |J c ' n ,fc-i) = n- 1 a 2 y jk y] k <S) x ik xj k , 
with E(||A n fc)|| 4 ) = 0(n~ 2 ) because of the finite fourth order moments. Moreover, 



E(A n) fcA^ jfe |.F n) fc_i) = cr 2 E mm , 



k=l 



and thus tends in probability to cr 2 T, yy (gi E x:r because of (A2). The assumptions of the 
martingale central limit theorem are met, we have that Ylk=i vec (^n.k) is asymptotically 
normal with mean zero and covariance matrix cr 2 T, yy (g) which concludes the proof. 
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C.2 Proof of Proposition 4 



We may first restrict minimization over the ball {W, \\W\\* ^ ||E~^£Af z ||*} because the 
optimum value is less than the value for W = S^Smz- Since this random variable 
is bounded in probability, we can reduce the problem to a compact set. The sequence of 
continuous random functions W i— > ^ vec(W— W) S mm vec 

converges pointwise in probability to W i— > ^vec(W — W) T E mm vec(VF — W) + Ao||W||* 
with a unique global minimum (because E mm is assumed invertible). We can thus apply 
standard result of consistency in M-estimation (Van der Vaart, 1998, Shao, 2003). 



C.3 Proof of Proposition 7 



We consider the result of Proposition 6: A = ^^(W — W) is asymptotically normal with 
mean zero and covariance o^E"^. By Proposition 18 in Appendix B, if 4n g 1 2 1| Ay 2 , < 
||UjAVj_||2, then rank(T-F) > r. For a random variable with normal distribution 
with mean zero and covariance matrix a 2 T,^ m , we let denote /(C) = P( 4C * a 1/2 H®!!! < 
||Uj@Vj_ H2). By the dominated convergence theorem, f(C) converges to one when C — > 00. 
Let e > 0, thus there exists Co > such that /(Co) > 1 — e/2. By the asymptotic normality 

result, P( 



-1/2 



,AC, 



-1/2 



^ S r 

'in- 1 / 2 



|A||| < ||UjAVj_ H2) converges to /(Co) thus 3uq > such that Vn > no, 
A||| < ||Uj AVj_ H2) > /(Cq) — e/2 > 1 — e, which concludes the proof, because 



A|| 2 < ||U{AVl|| 2 ) > 



,iC, 



-1/2 



|A||| < ||UjAVj_||2) as soon as n > Cq. 



C.4 Proof of Proposition 8 

This is the same result as Fu and Knight (2000), but extended to the trace norm minimiza- 
tion, simply using the directional derivative result of Proposition 22 and the epiconvergence 
theorem from Geyer (1994, 1996). Indeed, if we denote V n (A) = vec(A) T S mm vec(A) — 
trA T n i/2£ M£ + Aon i/2(|| W + n -i/2 A ||^ _ ||W||*) and V(A) = vec(A) T S mm vec(A) - 
trA T ^4 + Ao [trU T AV+ ||UjAVj_||*] , then for each A, V n (A) converges in probability 
to V(A), and V is strictly convex, which implies that it has an unique global minimum; 
thus the epi-convergence theorem can be applied, which concludes the proof. 



Note that a simpler analysis using regular tools in M-estimation leads to W 
n -1//2 A + o p (n -1 / 2 ), where A is the unique global minimizer of 



W + 



min — vec(A) T S T 

A6RPX9 2 



, vec(A) - trA T (n 1 / 2 S A/£ ) + A [trU T AV + HUjAV. 



i.e., we can actually take A 
ments) . 



n 



1 / 2 Sme (which is asymptotically normal with correct mo- 
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C.5 Proof of Proposition 9 

We let denote A = ^^(W — W). We first show that limsup n ^ 00 P(rank(W) = r) is smaller 
than the proposed limit a. We consider the following events: 

E Q = {rank(iy) = r} 

E X = {||r*-V2A|| 2 < s r /2} 

f 4r? -1 / 2 ) 
E 2 = <j — ||A||2 < ||Ul AV_l || 2 > • 

By Proposition 18 in Appendix B, we have E\ n E2 C Eq, and thus it suffices to show that 
P(-Ei) tends to one, while limsup n ^ oc Pf^-Kj) ^ a - The first assertion is a simple consequence 
of Proposition 8. 

Moreover, by Proposition 8, A converges in distribution to the unique global optimum 
A (A) of an optimization problem parameterized by a vector A with normal distribution. 
For a given 77 > 0, we consider the probability P(||UTA(A)Vj_||2 ^ ??)• For any A, when rj 
tends to zero, the indicator function l||ujA(^)v ± || 2 ^»7 converges to l||ujA(A)v ± || 2 =0' wn ich 
is equal to 1||a(A)|| 2 ^a ' where 

vec(A(A)) = ((Vl ®U ± ) T E-^(V X ®Uj.)) -1 ((V ± ® U ± ) T E^((V ® U) vec(I)-vec(A))) . 

By the dominated convergence theorem, P(||Uj A(A)Y± \\ 2 ^ rj) converges to a = P(||A(A)||2 ^ 
Ao), which is the proposed limit. This limit is in (0, 1) because of the normal distribution 
has an invertible covariance matrix and the set {||A||2 1} and its complement have non 
empty interiors. 

Since A = O p (l), we can instead consider E3 = {— s — M 2 < AVj_ H2} for a partic- 
ular M, instead of E 2 . Then following the same line or arguments than in Appendix C.3, 
we conclude that limsup^^^ P(£ , g) ^ a, which concludes the first part of the proof. 



We now show that lim inf n ^oo P(rank(I^) = r) ^ a. A sufficient condition for rank 
consistency is the following: we let denote W = USV T the singular value decomposition of 
W and we let denote U a and V a the singular vectors corresponding to all but the r largest 



singular values. Since we have simu 
condition is that rank(W) ^ r and 



taneous singular value decompositions, a sufficient 
Uj (t mm (W - W) - t M e) K|| 2 < A n (l - 77). If 

||A(n 1/2 S A / e )|| < A (l -77), then, by Lemma 11, UjA(n 1/2 S A / £ )V_L = 0, and we get, using 
the proof of Proposition 8 and the notation A = n 1 ' 2 EMe : 



Uj (± mm (W -W)-± M£ )v o = Ujn- 1 / 2 (t mm A(A)-A)v o + o p (n- 1 / 2 ). 



Moreover, because of regular consistency and a positive eigengap for W, the projection 
onto the first r singular vectors of W converges to the projection onto the first r singular 
vectors of W (see Appendix A), which implies that the projection onto the orthogonal 
is also consistent, i.e., U Uj converges in probability to U^UT and V D Vj converges in 
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probability to Vj_Vj. Thus: 



u> [Zmm(w - w) - £ Me j v 2 = U Uj (t mm {W - W) - e Me ) v vj 

= n- 1 / 2 ||U ± Ul(S mm A(i) - A)V ± Vl|| 3 + o p { n - 1 ' 2 ) 

= n-y 2 \\K{A)\\ 2 + o p {n-y 2 ). 

This implies that 



lim inf 

n— >oo 



y o ( ~^mm(W — W) — Sm e ) V < A n (l-ry) ^lim inf P(||A(A)|| 2 < A (l - r/)) 

2 n-^oo 



which converges to a when rj tends to zero, which concludes the proof. 
C.6 Proof of Proposition 10 

This is the same result as Fu and Knight (2000), but extended to the trace norm min- 
imization, simply using the directional derivative result of Proposition 22. If we write 
W = W + A n A, then A is defined as the global minimum of 

K(A) = ivec(A) T S mm vec(A) - A^trA^M, + A^QIW + A„A||* - ||W||*) 

= ^vec(A) T S mm vec(A) + O p (C„||A|| 2 ) + O p (A; 1 n~ 1 / 2 ) +trA T S Me 

+trU T AV + ||UlAVi||, + O p (A n ||A|||) 
= V(A) + P (Cn||A|| 2 ) + OpiX^n- 1 ' 2 ) + O p (A„||A|| 2 ). 

More precisely, if MX n < s r /2, 

E sup |F n (A)-y(A)| = est x fM^IIE^-E^llir + MA-^dlEM.H^^ + A^M 2 

l|A|| 2 <M V 

= 0(M 2 ( n + MX^n^ 1/2 + X n M 2 ). 

Moreover, V(A) achieves its minimum at a bounded point Ao ^ 0. Thus, by Markov 
inequality the minimum of V n (A) over the ball ||A||2 < 2||Ao||2 is with probability tending 
to one strictly inside and is thus also the unconstrained minimum, which leads to the 
proposition. 

C.7 Proof of Proposition 11 

The optimal A G W xq should be such that UjAV ± has low rank, where U_l G RP*(p- T ) 
and Vj_ G M^xfa-r) are orthogonal complements of the singular vectors U and V. We now 
derive the condition under which the optimal A is such that UJAV_l is actually equal 
to zero: we consider the minimum of \ vec(A) T £ mm vec(A) + vec(A) T vec(UV T ) with 
respect to A such that vec(Uj AVjJ = (Vj_ <g> Uj_) T vec(A) = 0. The solution of that 
constrained optimization problem is obtained through the following linear system (Boyd 
and Vandenberghe, 2003): 

S mm (Vi ®Ul) A / vec(A) \ _ ( -vec(UV T ) \ 
(V ± ®U ± ) T )\ vec(A) J I i' 
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where A e R(p-r)x( 9 -r) ig the La 

grange multiplier for the equality constraint. We can solve 
explicitly for A and A which leads to 

vec(A) = ((V ± ®U ± ) T £-^(V ± <g)U ± )) -1 ((V ± ® U ± ) T S"J n (V <g> U) vec(I)) , 

and 

vec(A) = -S- 1 m vec(UV T - U^AVj). 

Then the minimum of the function -F(A) in Eq. (5) is such that UjAVj_ = (and thus 
equal to A defined above) if and only if for all E M px,? , the directional derivative of F at 
A in the direction is nonnegative, i.e.: 

F(A + e0)-F(A) 
lim — i '- i— ^ ^ 0. 

£ ^o+ e 

By Proposition 22, this directional derivative is equal to 

tr0 T (S mm A + UV T ) + ||Ul6V ± ||* = tr0 T U ± AV ± + ||u!gv ± ||* 

= trA T Ul©V ± + ||UleVj.||»,. 

Thus the directional derivative is always non neg ative if for all 0' € R(p- r ) x (^ r ), trA T 0' + 
He'll* ^ 0, i.e., if and only if ||A||2 ^ 1, which concludes the proof. 

C.8 Proof of Theorem 12 

Regular consistency is obtained by Corollary 5. We consider the problem in Eq. (5) of 
Proposition 10, where \ n n 1 / 2 — > oo and A n — ► 0. Since Eq. (8) is satisfied, the solution A 
indeed satisfies UjAVj_ = by Lemma 11. 

We have W = W + A n A + o p (A n ) and we now show that the optimality conditions 
are satisfied with rank r. From the regular consistency, the rank of W is, with probability 
tending to one, larger than r (because the rank is lower semi-continuous function). We now 
need to show that it is actually equal to r. We let denote W = USV T the singular value 
decomposition of W and we let denote U Q and V the singular vectors corresponding to all 
but the r largest singular values. Since we have simultaneous singular value decompositions, 
we simply need to show that, Uj (T, mm (W — W) — Ejvf E J V Q < X n with probability 
tending to one. We have: 

Uj (t mm {W - W) - £ M e) V = Uj (A n S mm A + o p (X n ) - O p {n- 1 ' 2 )) V Q 

= X n llJ (T, mm A)V + o p (A n ). 

Moreover, because of regular consistency and a positive eigengap for W, the projection 
onto the first r singular vectors of W converges to the projection onto the first r singular 
vectors of W (see Appendix A), which implies that the projection onto the orthogonal 
is also consistent, i.e., U Q Uj converges in probability to Uj_Uj and V a Vj converges in 
probability to VlVJ. Thus: 

Uj (t mm (W-W)-t M e)v o \\ 2 = \\U Uj (t mm (W-W)-t M e)V Vj 

= A n ||UjUl(£ mm A)VjVl|| 2 + o p (X n ) 
= A„||A|| 2 + O p (An). 
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This implies that that the last expression is asymptotically of magnitude strictly less than 
one, which concludes the proof. 



C.9 Proof of Theorem 13 

We have seen earlier that if n 1 / 2 A n tends to zero and A n tends to zero, then Eq. (7) is 
necessary for rank-consistency. We just have to show that there is a subsequence that does 
satisfy this. If liminf A n > 0, then we cannot have consistency (by Proposition 6), thus if 
we consider a subsequence, we can always assume that A n tends to zero. 

We now consider the sequence n 1//2 A n , and its accumulation points. If zero or +oo is 
one of them, then by Propositions 7 and 9, we cannot have rank consistency. Thus, for all 
acccumulation points (which are finite and strictly positive), by considering a subsequence, 
we are in the situation where n 1//2 A n tends to +oo and A„ tends to zero, which implies 
Eq. (7), by definition of A in Eq. (6) and Lemma 11. 



C.10 Proof of Theorem 15 

We let denote U r LS and V[ s the first r columns of Uls and Vls and £7£ s and V£ s the 
remaining columns; we also denote s r LS the corresponding first r singular values and s° LS 
the remaining singular values. From Lemma 14 and results in the appendix, we get that 
\\s r LS — s || 2 = O p {n- 1 / 2 ) and ||s° 5 || 2 = O p {n- l l 2 ) and \\Ul s {Ul s ) T - UU T || 2 = O p {n~ 1 / 2 ) 
and \\V[ S {V[ S ) T - VV T || 2 = C^n" 1 / 2 ). By writing W A = W + n~ 1 / 2 A A , A A is defined as 
the minimum of 

ivec(A) T S mm vec(A) - n 1 / 2 trA T t M£ + n\ n (\\AWB + n~ 1/2 AAB\\* - \\AWB\\* 
We have: 



AU = U L sDi a g(s LS r~<u2 s U 

= Uls Bi ag (sl s r^Ui s ) T XJ + U° LS Diag(4 5 r 7 (ir£s) T U 

= U Diag(s)- 7 + Opin- 1 / 2 ) + O p (n" 1/2 n 7/2 ) 

= UDiag(s)-T + O p (n- 1 / 2 n^ 2 ), 



and 

AU ± = U LS Bmg(s LS )-m]; s \J, 



Ul s Di a g(si s y-<(Uls) T XJ ± + UlsBmgisisr^UlsVu, 

U ± Bmg(s LS )-^ + O p (n^ 2 - 1 / 2 ) 

O p (n^ 2 ). 



Similarly we have: BY = VDiag(s)~^ + O v {rT x l 2 n^l 2 ) and BV = O p (n'/ 2 ). We can 
decompose any A G W xq as A = (1 
A n n 1//2 n 7//2 tends to infinity. Thus, 



decompose any A G MP xq as A = (U Uj_) ( ^ rr ^ ro ) (V Vj_) T . We have assumed that 
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• if U[A = and AV ± = (i.e., if A is of the form UA rr V T ), 

n\ n \\AWB + n~ 1/2 AAB\\* - \\AWB\\* < \ n n 1,2 \\AAB\\* 

= A n n 1 / 2 ||Diag(s)-TA rr Diag(s)-T|| 1t + O p {\ 
= O p {X n n 1 / 2 ) 

tends to zero. 

• Otherwise, n\ n \\AWB + n _1 / 2 AAS||* - ||^4W£||* is larger than Xn^^WAAB^ - 
2||ylW.B||*. The term ||^4W£>||* is bounded in probability because we can write 
AWB = UDiag(s) 1 ~ 2 TV T + Op^" 1 /^/ 2 ) and 7 < 1. Besides, Xn-n^WAAB^ is 
tending to infinity as soons as any of A or , A ro or A rr are different from zero. Indeed, 
by equivalence of finite dimensional norms A n n 1 / 2 ||ylAi?||* is larger than a constant 
times Ann 1 / 2 )! AAB\\p, which can be decomposed in four pieces along (U, UjJ and 
(V, Vjl), corresponding asymptotically to A 00 , A or , A ro or A rr . The smallest of those 
terms grows faster than Ann 1 / 2 " 1 " 7 / 2 , and thus tends to infinity. 

Thus, since is invertible, by the epi-convergence theorem of Geyer (1994, 1996), 

Aa converges in distribution to the minimum of 

- vec(A) T S mm vec(A) - n 1/2 trA T S A / £ , 

such that UjA = and AVj_ = 0. This minimum has a simple asymptotic distribu- 
tion, namely A = U0V T and is asymptotically normal with mean zero and covariance 
matrix a 2 [(V ® U) T S mm (V ® U)] , which leads to the consistency and the asymptotic 
normality. 

In order to finish the proof, we consider the optimality conditions which can be written 
as AAB and 

A^ 1 (£ mm A A - n 1 / 2 S Me ) B- 1 

having simultaneous singular value decompositions with proper decays of singular values, 
i.e, such that the first r are equal to X n ii 1 / 2 and the remaining ones are less than X n n l l 2 . 

From the asymptotic normality wg get that X] mm Aa — n 1//2 SMe is O p (l), we can thus 
consider matrices of the form A~ 1 QB~ 1 where is bounded, the same way we considered 
matrices of the form AAB. 

We have: 

A-'XJ = U LS Bmg(s LS )mJ s V 

= Ui s Di a g(si s )^Ui s ) T V + U° LS Di ag ( S ° Ls r(U° LS ) T XJ 
= UDiag(s) 7 + Op(n^ 1 / 2 ), 

and 

A~ 1 U± = ULsTiha^SLsVUlsVx 

= Uls Di ag (sl s r(Ul s ) T V ± + Ills Diag( S £ 5 ) 7 ([/£ 5 ) T U ± 
= O p (n- 1 / 2 )+U ± Diag( S i 5 ) 7 , 
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with similar expansions for B V and B Vj_. We obtain the first order expansion: 

A^GB- 1 = UDiag(s) 7 rr Diag(s) 7 V T + U ± Diag(4 5 ) 7 or Diag(s) 7 V T 

+ UDiag(s)^e ro Diag(^Vl + V± Diag(s^) 7 G 00 Diag(4 5 ) 7 Vl 

Because of the regular consistency, the first term is of the order of A n n 1//2 (so that the 
first r singular values of W are strictly positive), while the three other terms have norms 
less than O p (n~ 7 / 2 ) which is less than O p {n l l 2 \ n ) by assumption. This concludes the proof. 
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